Skip to content
This repository has been archived by the owner on Aug 30, 2024. It is now read-only.

Add Fused-Attention Layer for AVX2 Platforms #137

Merged
merged 10 commits into from
Feb 26, 2024
Merged

Conversation

DDEle
Copy link
Contributor

@DDEle DDEle commented Feb 22, 2024

Type of Change: feature

API not changed.

Description

As title. In addition, add NE_ATTN_PREFER_FP32 to use f32 compute type on AVX512 platforms.

How has this PR been tested?

Local tests && CI

Windows 11 - 10th Gen Desktop

Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz | q4-j-f32-g128 | 1002-token-prompt

--memory-auto --memory-f16 main branch
first token 21835.96ms 26775.99ms 28961.59ms
next token 203.61ms 205.32ms 215.63ms
Details:
(yi) PS C:\Users\dingyi\GitHub\neural-speed\neural_speed\build> cmake ../.. -G Ninja -B . -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=ON -DNS_PROFILING=ON ; cmake --build . -- run_llama ; powershell -Command { $env:NEURAL_SPEED_VERBOSE=0 ; $env:OMP_NUM_THREADS="" ; .\bin\run_llama.exe --seed 1234 -t 20 -b 1024 -c 1024 -m 'C:\Users\dingyi\Intel Corporation\ITREX - Documents\Runtime\Models\Llama-2-7b-chat-hf-pr136-q4-j-f32-g128.bin' --memory-auto -n 5 -p @(Get-Content  ~/LUOYU_PROMPT.txt | cut -d' ' -f 1-750 ) }
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] = 166.486 ms
perf_total_per_op_us[                     MUL] = 106.211 ms
perf_total_per_op_us[                RMS_NORM] =  99.114 ms
perf_total_per_op_us[                 MUL_MAT] = 2033.688 ms
perf_total_per_op_us[                 RESHAPE] =   0.022 ms
perf_total_per_op_us[                    VIEW] =   0.086 ms
perf_total_per_op_us[                 PERMUTE] =   0.006 ms
perf_total_per_op_us[               TRANSPOSE] =   0.005 ms
perf_total_per_op_us[                GET_ROWS] =   8.672 ms
perf_total_per_op_us[                    ROPE] = 182.861 ms
perf_total_per_op_us[                 MUL_QKV] = 4901.854 ms
perf_total_per_op_us[                FFN_SILU] = 12930.254 ms
perf_total_per_op_us[              FLASH_ATTN] = 882.787 ms
perf_total_per_op_us[    FLASH_ATTN_KV_UPDATE] = 521.047 ms
perf_total_per_op_us[           INNER PRODUCT] =   0.000 ms
========================================
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] =   0.316 ms
perf_total_per_op_us[                     MUL] =   0.256 ms
perf_total_per_op_us[                RMS_NORM] =   0.311 ms
perf_total_per_op_us[                 RESHAPE] =   0.026 ms
perf_total_per_op_us[                    VIEW] =   0.065 ms
perf_total_per_op_us[                 PERMUTE] =   0.005 ms
perf_total_per_op_us[               TRANSPOSE] =   0.004 ms
perf_total_per_op_us[                GET_ROWS] =   0.013 ms
perf_total_per_op_us[                    ROPE] =   0.596 ms
perf_total_per_op_us[                 MUL_QKV] =  45.981 ms
perf_total_per_op_us[                FFN_SILU] = 119.045 ms
perf_total_per_op_us[              FLASH_ATTN] =  20.475 ms
perf_total_per_op_us[    FLASH_ATTN_KV_UPDATE] =   2.085 ms
perf_total_per_op_us[           INNER PRODUCT] =  18.903 ms
========================================
model_print_timings:        load time = 21857.11 ms
model_print_timings:      sample time =     2.74 ms /     5 runs   (    0.55 ms per token)
model_print_timings: prompt eval time = 21835.96 ms /  1002 tokens (   21.79 ms per token)
model_print_timings:        eval time =   847.00 ms /     4 runs   (  211.75 ms per token)
model_print_timings:       total time = 23013.19 ms
========== eval time log of each prediction ==========
prediction   0, time: 21835.96ms
prediction   1, time: 214.34ms
prediction   2, time: 203.61ms
prediction   3, time: 218.20ms
prediction   4, time: 210.84ms


(yi) PS C:\Users\dingyi\GitHub\neural-speed\neural_speed\build> cmake ../.. -G Ninja -B . -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=ON -DNS_PROFILING=ON ; cmake --build . -- run_llama ; powershell -Command { $env:NEURAL_SPEED_VERBOSE=0 ; $env:OMP_NUM_THREADS="" ; .\bin\run_llama.exe --seed 1234 -t 20 -b 1024 -c 1024 -m 'C:\Users\dingyi\Intel Corporation\ITREX - Documents\Runtime\Models\Llama-2-7b-chat-hf-pr136-q4-j-f32-g128.bin' --memory-f16 -n 5 -p @(Get-Content  ~/LUOYU_PROMPT.txt | cut -d' ' -f 1-750 ) }
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] = 165.466 ms
perf_total_per_op_us[                     MUL] = 103.186 ms
perf_total_per_op_us[                RMS_NORM] =  98.146 ms
perf_total_per_op_us[                 MUL_MAT] = 7802.505 ms
perf_total_per_op_us[                   SCALE] = 393.494 ms
perf_total_per_op_us[                     CPY] = 298.809 ms
perf_total_per_op_us[                 RESHAPE] =   0.033 ms
perf_total_per_op_us[                    VIEW] =   0.057 ms
perf_total_per_op_us[                 PERMUTE] =   0.028 ms
perf_total_per_op_us[               TRANSPOSE] =   0.005 ms
perf_total_per_op_us[                GET_ROWS] =   8.649 ms
perf_total_per_op_us[           DIAG_MASK_INF] = 178.642 ms
perf_total_per_op_us[                SOFT_MAX] = 330.897 ms
perf_total_per_op_us[                    ROPE] = 171.783 ms
perf_total_per_op_us[                 MUL_QKV] = 4708.340 ms
perf_total_per_op_us[                FFN_SILU] = 12512.661 ms
perf_total_per_op_us[           INNER PRODUCT] =   0.000 ms
========================================
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] =   0.315 ms
perf_total_per_op_us[                     MUL] =   0.262 ms
perf_total_per_op_us[                RMS_NORM] =   0.312 ms
perf_total_per_op_us[                 MUL_MAT] =  21.666 ms
perf_total_per_op_us[                   SCALE] =   0.443 ms
perf_total_per_op_us[                     CPY] =   1.276 ms
perf_total_per_op_us[                 RESHAPE] =   0.022 ms
perf_total_per_op_us[                    VIEW] =   0.066 ms
perf_total_per_op_us[                 PERMUTE] =   0.017 ms
perf_total_per_op_us[               TRANSPOSE] =   0.006 ms
perf_total_per_op_us[                GET_ROWS] =   0.014 ms
perf_total_per_op_us[           DIAG_MASK_INF] =   0.028 ms
perf_total_per_op_us[                SOFT_MAX] =   0.836 ms
perf_total_per_op_us[                    ROPE] =   0.651 ms
perf_total_per_op_us[                 MUL_QKV] =  44.320 ms
perf_total_per_op_us[                FFN_SILU] = 113.190 ms
perf_total_per_op_us[           INNER PRODUCT] =  18.770 ms
========================================
model_print_timings:        load time = 26796.92 ms
model_print_timings:      sample time =     2.46 ms /     5 runs   (    0.49 ms per token)
model_print_timings: prompt eval time = 26775.99 ms /  1002 tokens (   26.72 ms per token)
model_print_timings:        eval time =   846.46 ms /     4 runs   (  211.62 ms per token)
model_print_timings:       total time = 27949.95 ms
========== eval time log of each prediction ==========
prediction   0, time: 26775.99ms
prediction   1, time: 210.62ms
prediction   2, time: 211.09ms
prediction   3, time: 219.42ms
prediction   4, time: 205.32ms


(yi) PS C:\Users\dingyi\GitHub\neural-speed\neural_speed\build> git checkout origin/main ; cmake ../.. -G Ninja -B . -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=OFF -DNS_PROFILING=ON ; cmake --build . -- run_llama ; powershell -Command { $env:NEURAL_SPEED_VERBOSE=0 ; $env:OMP_NUM_THREADS="" ; .\bin\run_llama.exe --seed 1234 -t 20 -b 1024 -c 1024 -m 'C:\Users\dingyi\Intel Corporation\ITREX - Documents\Runtime\Models\Llama-2-7b-chat-hf-pr136-q4-j-f32-g128.bin' --memory-f16 -n 5 -p @(Get-Content  ~/LUOYU_PROMPT.txt | cut -d' ' -f 1-750 ) }
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] = 164.695 ms
perf_total_per_op_us[                     MUL] = 104.046 ms
perf_total_per_op_us[                RMS_NORM] =  98.029 ms
perf_total_per_op_us[                 MUL_MAT] = 8282.626 ms
perf_total_per_op_us[                   SCALE] = 414.399 ms
perf_total_per_op_us[                     CPY] = 332.220 ms
perf_total_per_op_us[                 RESHAPE] =   0.026 ms
perf_total_per_op_us[                    VIEW] =   0.072 ms
perf_total_per_op_us[                 PERMUTE] =   0.032 ms
perf_total_per_op_us[               TRANSPOSE] =   0.005 ms
perf_total_per_op_us[                GET_ROWS] =   8.580 ms
perf_total_per_op_us[           DIAG_MASK_INF] = 173.903 ms
perf_total_per_op_us[                SOFT_MAX] = 334.689 ms
perf_total_per_op_us[                    ROPE] = 201.331 ms
perf_total_per_op_us[                 MUL_QKV] = 5128.779 ms
perf_total_per_op_us[                FFN_SILU] = 13714.927 ms
perf_total_per_op_us[           INNER PRODUCT] =   0.000 ms
========================================
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] =   0.328 ms
perf_total_per_op_us[                     MUL] =   0.258 ms
perf_total_per_op_us[                RMS_NORM] =   0.325 ms
perf_total_per_op_us[                 MUL_MAT] =  22.338 ms
perf_total_per_op_us[                   SCALE] =   0.444 ms
perf_total_per_op_us[                     CPY] =   1.245 ms
perf_total_per_op_us[                 RESHAPE] =   0.013 ms
perf_total_per_op_us[                    VIEW] =   0.069 ms
perf_total_per_op_us[                 PERMUTE] =   0.024 ms
perf_total_per_op_us[               TRANSPOSE] =   0.005 ms
perf_total_per_op_us[                GET_ROWS] =   0.014 ms
perf_total_per_op_us[           DIAG_MASK_INF] =   0.030 ms
perf_total_per_op_us[                SOFT_MAX] =   0.887 ms
perf_total_per_op_us[                    ROPE] =   0.568 ms
perf_total_per_op_us[                 MUL_QKV] =  44.660 ms
perf_total_per_op_us[                FFN_SILU] = 130.157 ms
perf_total_per_op_us[           INNER PRODUCT] =  19.817 ms
========================================
model_print_timings:        load time = 28982.10 ms
model_print_timings:      sample time =     2.45 ms /     5 runs   (    0.49 ms per token)
model_print_timings: prompt eval time = 28961.59 ms /  1002 tokens (   28.90 ms per token)
model_print_timings:        eval time =   890.80 ms /     4 runs   (  222.70 ms per token)
model_print_timings:       total time = 29877.89 ms
========== eval time log of each prediction ==========
prediction   0, time: 28961.59ms
prediction   1, time: 218.51ms
prediction   2, time: 232.21ms
prediction   3, time: 215.63ms
prediction   4, time: 224.45ms

Ubuntu - MTL

gtax (Client 3259) | q4-j-int8-g128 | 1332-token-prompt

--memory-auto --memory-f16 main branch
first token 25045.54ms 40714.51ms 41992.02ms
next token 202.26ms 194.63ms 199.88ms
Details:
(yi) gta@DUT1225MTLS:~/neural-speed/neural_speed/build$ cmake ../.. -GNinja -DNS_TP=OFF -DNS_PROFILING=ON -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=ON && cmake --build . -- run_llama && env NEURAL_SPEED_VERBOSE=0 bin/run_llama -m ~/Llama-2-7b-chat-hf-pr136-q4-j-int8-g128.bin --seed 1 -t 14 -n 5 -c 2048 -b 2048 -p "$(echo $LUOYU_PROMPT|cut -d' ' -f 1-1000)" --memory-auto
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] = 168.061 ms
perf_total_per_op_us[                     MUL] =  78.413 ms
perf_total_per_op_us[                RMS_NORM] =  76.767 ms
perf_total_per_op_us[                 MUL_MAT] = 2383.484 ms
perf_total_per_op_us[                 RESHAPE] =   0.059 ms
perf_total_per_op_us[                    VIEW] =   0.243 ms
perf_total_per_op_us[                 PERMUTE] =   0.019 ms
perf_total_per_op_us[               TRANSPOSE] =   0.015 ms
perf_total_per_op_us[                GET_ROWS] =  13.225 ms
perf_total_per_op_us[                    ROPE] = 224.337 ms
perf_total_per_op_us[                 MUL_QKV] = 4972.893 ms
perf_total_per_op_us[                FFN_SILU] = 12881.324 ms
perf_total_per_op_us[              FLASH_ATTN] = 3571.979 ms
perf_total_per_op_us[    FLASH_ATTN_KV_UPDATE] = 672.456 ms
perf_total_per_op_us[           INNER PRODUCT] =   0.000 ms
========================================
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] =   0.303 ms
perf_total_per_op_us[                     MUL] =   0.239 ms
perf_total_per_op_us[                RMS_NORM] =   0.292 ms
perf_total_per_op_us[                 RESHAPE] =   0.058 ms
perf_total_per_op_us[                    VIEW] =   0.172 ms
perf_total_per_op_us[                 PERMUTE] =   0.017 ms
perf_total_per_op_us[               TRANSPOSE] =   0.020 ms
perf_total_per_op_us[                GET_ROWS] =   0.009 ms
perf_total_per_op_us[                    ROPE] =   1.317 ms
perf_total_per_op_us[                 MUL_QKV] =  49.494 ms
perf_total_per_op_us[                FFN_SILU] = 111.426 ms
perf_total_per_op_us[              FLASH_ATTN] =  31.079 ms
perf_total_per_op_us[    FLASH_ATTN_KV_UPDATE] =   2.893 ms
perf_total_per_op_us[           INNER PRODUCT] =  18.746 ms
========================================
model_print_timings:        load time = 25046.31 ms
model_print_timings:      sample time =     3.82 ms /     5 runs   (    0.76 ms per token)
model_print_timings: prompt eval time = 25045.53 ms /  1332 tokens (   18.80 ms per token)
model_print_timings:        eval time =   844.85 ms /     4 runs   (  211.21 ms per token)
model_print_timings:       total time = 25969.81 ms
========== eval time log of each prediction ==========
prediction   0, time: 25045.54ms
prediction   1, time: 218.55ms
prediction   2, time: 202.26ms
prediction   3, time: 206.46ms
prediction   4, time: 217.57ms


(yi) gta@DUT1225MTLS:~/neural-speed/neural_speed/build$ cmake ../.. -GNinja -DNS_TP=OFF -DNS_PROFILING=ON -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=ON && cmake --build . -- run_llama && env NEURAL_SPEED_VERBOSE=0 bin/run_llama -m ~/Llama-2-7b-chat-hf-pr136-q4-j-int8-g128.bin --seed 1 -t 14 -n 5 -c 2048 -b 2048 -p "$(echo $LUOYU_PROMPT|cut -d' ' -f 1-1000)" --memory-f16
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] = 146.843 ms
perf_total_per_op_us[                     MUL] =  94.579 ms
perf_total_per_op_us[                RMS_NORM] =  80.229 ms
perf_total_per_op_us[                 MUL_MAT] = 19673.163 ms
perf_total_per_op_us[                   SCALE] = 722.337 ms
perf_total_per_op_us[                     CPY] = 873.415 ms
perf_total_per_op_us[                 RESHAPE] =   0.100 ms
perf_total_per_op_us[                    VIEW] =   0.250 ms
perf_total_per_op_us[                 PERMUTE] =   0.105 ms
perf_total_per_op_us[               TRANSPOSE] =   0.025 ms
perf_total_per_op_us[                GET_ROWS] =  13.086 ms
perf_total_per_op_us[           DIAG_MASK_INF] = 154.136 ms
perf_total_per_op_us[                SOFT_MAX] = 541.224 ms
perf_total_per_op_us[                    ROPE] = 328.229 ms
perf_total_per_op_us[                 MUL_QKV] = 5061.945 ms
perf_total_per_op_us[                FFN_SILU] = 13022.054 ms
perf_total_per_op_us[           INNER PRODUCT] =   0.000 ms
========================================
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] =   0.294 ms
perf_total_per_op_us[                     MUL] =   0.242 ms
perf_total_per_op_us[                RMS_NORM] =   0.286 ms
perf_total_per_op_us[                 MUL_MAT] =  23.629 ms
perf_total_per_op_us[                   SCALE] =   0.361 ms
perf_total_per_op_us[                     CPY] =   1.567 ms
perf_total_per_op_us[                 RESHAPE] =   0.065 ms
perf_total_per_op_us[                    VIEW] =   0.154 ms
perf_total_per_op_us[                 PERMUTE] =   0.051 ms
perf_total_per_op_us[               TRANSPOSE] =   0.017 ms
perf_total_per_op_us[                GET_ROWS] =   0.008 ms
perf_total_per_op_us[           DIAG_MASK_INF] =   0.034 ms
perf_total_per_op_us[                SOFT_MAX] =   1.108 ms
perf_total_per_op_us[                    ROPE] =   0.744 ms
perf_total_per_op_us[                 MUL_QKV] =  38.141 ms
perf_total_per_op_us[                FFN_SILU] = 129.775 ms
perf_total_per_op_us[           INNER PRODUCT] =  18.884 ms
========================================
model_print_timings:        load time = 40715.27 ms
model_print_timings:      sample time =     3.76 ms /     5 runs   (    0.75 ms per token)
model_print_timings: prompt eval time = 40714.51 ms /  1332 tokens (   30.57 ms per token)
model_print_timings:        eval time =   868.72 ms /     4 runs   (  217.18 ms per token)
model_print_timings:       total time = 41651.98 ms
========== eval time log of each prediction ==========
prediction   0, time: 40714.51ms
prediction   1, time: 194.63ms
prediction   2, time: 233.76ms
prediction   3, time: 223.09ms
prediction   4, time: 217.24ms


(yi) gta@DUT1225MTLS:~/neural-speed/neural_speed/build$ git checkout origin/main && cmake ../.. -GNinja -DNS_TP=OFF -DNS_PROFILING=ON -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=OFF && cmake --build . -- run_llama && env NEURAL_SPEED_VERBOSE=0 bin/run_llama -m ~/Llama-2-7b-chat-hf-pr136-q4-j-int8-g128.bin --seed 1 -t 14 -n 5 -c 2048 -b 2048 -p "$(echo $LUOYU_PROMPT|cut -d' ' -f 1-1000)" --memory-f16
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] = 154.251 ms
perf_total_per_op_us[                     MUL] =  79.162 ms
perf_total_per_op_us[                RMS_NORM] =  89.996 ms
perf_total_per_op_us[                 MUL_MAT] = 20255.021 ms
perf_total_per_op_us[                   SCALE] = 1253.678 ms
perf_total_per_op_us[                     CPY] = 743.506 ms
perf_total_per_op_us[                 RESHAPE] =   0.142 ms
perf_total_per_op_us[                    VIEW] =   0.286 ms
perf_total_per_op_us[                 PERMUTE] =   0.142 ms
perf_total_per_op_us[               TRANSPOSE] =   0.027 ms
perf_total_per_op_us[                GET_ROWS] =  35.328 ms
perf_total_per_op_us[           DIAG_MASK_INF] = 151.711 ms
perf_total_per_op_us[                SOFT_MAX] = 542.413 ms
perf_total_per_op_us[                    ROPE] = 326.503 ms
perf_total_per_op_us[                 MUL_QKV] = 5146.380 ms
perf_total_per_op_us[                FFN_SILU] = 13207.853 ms
perf_total_per_op_us[           INNER PRODUCT] =   0.000 ms
========================================
...
=== GRAPH Profiling ===
perf_total_per_op_us[                     ADD] =   0.363 ms
perf_total_per_op_us[                     MUL] =   0.304 ms
perf_total_per_op_us[                RMS_NORM] =   0.487 ms
perf_total_per_op_us[                 MUL_MAT] =  23.838 ms
perf_total_per_op_us[                   SCALE] =   0.626 ms
perf_total_per_op_us[                     CPY] =   1.593 ms
perf_total_per_op_us[                 RESHAPE] =   0.087 ms
perf_total_per_op_us[                    VIEW] =   0.161 ms
perf_total_per_op_us[                 PERMUTE] =   0.066 ms
perf_total_per_op_us[               TRANSPOSE] =   0.018 ms
perf_total_per_op_us[                GET_ROWS] =   0.016 ms
perf_total_per_op_us[           DIAG_MASK_INF] =   0.035 ms
perf_total_per_op_us[                SOFT_MAX] =   1.066 ms
perf_total_per_op_us[                    ROPE] =   0.727 ms
perf_total_per_op_us[                 MUL_QKV] =  38.059 ms
perf_total_per_op_us[                FFN_SILU] = 110.460 ms
perf_total_per_op_us[           INNER PRODUCT] =  18.224 ms
========================================
model_print_timings:        load time = 41992.96 ms
model_print_timings:      sample time =     8.49 ms /     5 runs   (    1.70 ms per token)
model_print_timings: prompt eval time = 41992.02 ms /  1332 tokens (   31.53 ms per token)
model_print_timings:        eval time =   812.54 ms /     4 runs   (  203.14 ms per token)
model_print_timings:       total time = 42816.34 ms
========== eval time log of each prediction ==========
prediction   0, time: 41992.02ms
prediction   1, time: 198.25ms
prediction   2, time: 199.88ms
prediction   3, time: 214.47ms
prediction   4, time: 199.94ms

Accuracy (Ubuntu - MTL)

LGTM
# --memory-auto
(yi) gta@DUT1225MTLS:~/neural-speed/neural_speed/build$ cmake ../.. -GNinja -DNS_TP=OFF -DNS_PROFILING=ON -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=OFF && cmake --build . -- run_llama && env NEURAL_SPEED_VERBOSE=-1 bin/run_llama -m ~/Llama-2-7b-chat-hf-pr136-q4-j-int8-g128.bin --seed 1 -t 14 -n 32 -c 2048 -b 2048 -p "$GIRL_PROMPT"
Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. Oneday,she heard about a magic mirror, which could grant her wishes, and make her dreams come true. Excited, she went to

# --memory-f16
(yi) gta@DUT1225MTLS:~/neural-speed/neural_speed/build$ cmake ../.. -GNinja -DNS_TP=OFF -DNS_PROFILING=ON -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=OFF && cmake --build . -- run_llama && env NEURAL_SPEED_VERBOSE=-1 bin/run_llama -m ~/Llama-2-7b-chat-hf-pr136-q4-j-int8-g128.bin --seed 1 -t 14 -n 32 -c 2048 -b 2048 -p "$GIRL_PROMPT" --memory-f16
Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. Oneday,she heard about a magic mirror, which could grant her wishes, and make her dreams come true. Excited, she went to

# main branch
(yi) gta@DUT1225MTLS:~/neural-speed/neural_speed/build$ git checkout origin/main && cmake ../.. -GNinja -DNS_TP=OFF -DNS_PROFILING=ON -DCMAKE_BUILD_TYPE=Release -DNS_BUILD_TESTS=OFF && cmake --build . -- run_llama && env NEURAL_SPEED_VERBOSE=-1 bin/run_llama -m ~/Llama-2-7b-chat-hf-pr136-q4-j-int8-g128.bin --seed 1 -t 14 -n 32 -c 2048 -b 2048 -p "$GIRL_PROMPT"
Once upon a time, there existed a little girl, who liked to have adventures. She wanted to go to places and meet new people, and have fun. Oneday,she heard about a magic mirror, which could grant her wishes, and make her dreams come true. Excited, she went to

Dependency Change?

N/A

@DDEle DDEle marked this pull request as ready for review February 22, 2024 07:25
neural_speed/core/layers/mha_dense.cpp Outdated Show resolved Hide resolved
neural_speed/core/layers/mha_dense.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@luoyu-intel luoyu-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better make mha_dense files as simple as ffn or qkv. It's too complicated for these compiler flags and intrinsic codes.

@DDEle
Copy link
Contributor Author

DDEle commented Feb 22, 2024

Better make mha_dense files as simple as ffn or qkv. It's too complicated for these compiler flags and intrinsic codes.

  • moved btla templates to mha_dense_wrapper.h in c199f76
  • moved tests to mha_dense_tests.cpp in 2d3f4e4

@DDEle DDEle requested a review from luoyu-intel February 22, 2024 11:54
@airMeng airMeng merged commit bc5ee16 into intel:main Feb 26, 2024
12 checks passed
@yuchengliu1 yuchengliu1 mentioned this pull request Mar 15, 2024
@kevinintel kevinintel mentioned this pull request Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants